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Abstract. Many communication networks contain nodes which may 
misbehave, thus incurring a cost to the network operator. We consider 
the problem of how to manage the nodes when the operator receives a 
payoff for every moment a node stays within the network, but where 
each malicious node incurs a hidden cost. The operator only has some 
statistical information about each node's type, and never observes the 
cost. We consider the case when there are two possible actions: removing 
a node from a network permanently, or keeping it for at least one more 
time-step in order to obtain more information. Consequently, the prob- 
lem can be seen as a special type of intrusion response problem, where 
the only available response is blacklisting. We first examine a simple al- 
gorithm (HlPER) which has provably good performance compared to an 
oracle that knows the type (honest or malicious) of each node. We then 
derive three other approximate algorithms by modelling the problem as a 
Markov decision process. To the best of our knowledge, these algorithms 
have not been employed before in network management and intrusion 
response problems. Through experiments on various network conditions, 
we conclude that HlPER performs almost as well as the best of these 
approaches, while requiring significantly less computation. 



1 Introduction 

We consider a communication network which is being partially monitored by 
a network management system. The nodes in this network can be of one of 
two types: malicious (e.g. dropping packets), or honest. We assume that we 
have some tangible gain for every moment that an honest node remains in the 
network, while keeping malicious nodes in the network carries a cost, both of 
which are hidden from the system. The system maintains some statistics about 
the activity of each node, which enable it to approximately guess its type. These 
could be alerts gathered from some intrusion detection system (IDS). Another 
example would be data-sharing statistics from a peer-to-peer (P2P) network, 
which would help identify selfish nodes. We require a response mechanism that 
removes malicious nodes as soon as possible, without inadvertently removing 
honest nodes, with high probability. In this setting, immediately removing nodes 
that seem suspicious, is suboptimal: if a node is removed from the network then 
no further information can be received from this node. Thus, it pays to delay 
removal until we collect more information about suspicious nodes. 
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Our first contribution is a decision-theoretic approach based on distribution- 
free high probability bounds. The bounds require very little prior information 
and can be used to trade off the cost of removing honest nodes with that of 
keeping malicious nodes in the network for too long. We prove that this High 
Probability Efficient Response algorithm (HiPER) has low worst-case expected 
loss relative to an oracle which knows the type of every node. 

Our second contribution is a set of Bayesian decision-theoretic approaches 
that we derive by formalising the problem as a Markov decision process. These 
require some further assumptions. In particular, it is necessary to fully specify a 
structure and prior parameters for the underlying statistical model. In addition, 
making optimal decisions according to such models is in our case computationally 
intractable. Consequently, we consider some approximate algorithms. Of these, 
an optimistic approximation has similar performance to that of HiPER. An- 
other approximation, which performs finite lookahead, can obtain some further 
performance gain, at the cost of additional computation. 

The paper is organised as follows. In the remainder of this section we give 
some background, present the related work and describe our contributions. Sec- 
tion [2] introduces notation while Section [3] specifies the loss model. Section 0] 
presents the proposed HiPER algorithm as well as the bounds on the worst- 
case expected loss. Section [5] describes the decision-theoretic approaches which 
model the problem as an MDP and are used in the performance comparisons 
with the HiPER algorithm while Scction|5]describes the evaluation experiments. 
Finally, Section [7] concludes the paper. The appendix contains proofs of some 
technical lemmas and provides some useful auxiliary results for completeness. 

1.1 Background 

The problem we consider falls within the scope of (statistical) decision theory. In 
particular, the particular scenario we investigate can be reduced to the optimal 
stopping problem [B] , which can be modelled as an (unknown) Markov Decision 
Process [5] (MDP) or as a (potentially unknown) Partially Observable MDP 
(POMDP) p2]. 

More precisely, in our setting, the nodes can be one of two types: honest or 
malicious. However, we initially start out without knowing what type each node 
is. Consequently, we must gather data (observations) to reduce our uncertainty 
about their types. Unfortunately, we can only do so while a node remains within 
the network. However, the longer we maintain a malicious node in the network, 
the more loss we incur. Conversely, once we remove an honest node, we will 
obtain no more profit from it. Thus, the problem can be reduced to deciding at 
what time, or under which conditions, to remove a given node from the network, 
if at all. Consequently, our scenario can be seen as a type of optimal stopping 
problem. 

Optimal stopping problems can be seen as MDPs |8]. An MDP models the 
interaction between an environment and an agent. The environment has a state, 
which changes over time and whose next value depends on the current state, as 
well as the current action taken by the agent. In addition, at every time-step, the 
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agent receives a reward, which also depends on both the state and the agent's 
action. His goal is to maximise his total expected reward. Intrusion response 
can be modelled as an unknown MDP. where each node in the network has an 
unknown type (malicious or honest), and where the state describes the set of 
nodes which remain in the network. In our case, the responses that we make 
correspond to the actions of the agent. However, the reward we receive is hidden 
and depends on the unknown type of the nodes. Equivalently, we can view the 
nodes' type as a hidden state, in which case the intrusion response problem can 
be seen as a partially observable MDP (POMDP) Q 

To cast our problem in this setting, we need the following elements: a) the 
prior probability for each node being honest or malicious; b) a known distribu- 
tion family for the observation distribution, conditioned on whether the node 
under consideration is honest or malicious; c) a planning algorithm that will de- 
termine our responses. In general, this can be quite demanding computationally, 
as the solution to either the unknown MDP or the POMDP problem requires 
performing planning in a large tree. Finally, since in our case the reward remains 
hidden from the agent, the problem is an extreme case of a partial monitoring 
problem [TJ. 

1.2 Previous work 

The stopping problem has been extensively studied in general [5] , while partial 
monitoring games in general have also received a lot of attention recently [TJ. 
However, to the best of our knowledge, the general hidden reward stopping 
problem has not been previously studied in the literature. On the other hand, 
the specific application we consider can be seen as a type of optimal intrusion 
response, for which there is a considerable body of work. 

Most of the previous research on intrusion response has concentrated on the 
POMDP formalism. Indicative publications are those by Zonouz et al. [22], Zan 
et al., |20j and Zhang et al. |21j . which have all proposed an intrusion response 
through modelling the process as a POMDP [18]. More precisely, Zonouz et al. 
[22] proposed a Response and Recovery Engine (RRE) based on a game-theoretic 
response strategy against adversaries modelled as opponents in a non-zero-sum, 
two-player Stackelberg stochastic game. In each step of the game RRE chooses 
the response actions using an approximate POMDP solver. More precisely, using 
the most likely state (MLS) approximation, the POMDP is converted to 
a competitive Markov Decision Process (MDP), which is then solved using a 
look ahead search (i.e. approximate planning). Zhang et al. use the POMDP 
to integrate low level IDS alerts with high level system states, while Zan et. 
al. [20] propose to solve the intrusion response problem as a factored POMDP 
model. Additionally, they decompose the POMDP into small sub-POMDPs and 
compute the response policy using the MLS approximation technique. However, 

1 This is not contradictory, since unknown MDP problems are in fact a special case 
of POMDPs[TT] where the unknown parameters are seen as an unchanging, hidden 
state. 
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in our case MLS as an approximation is too crude to be used, since it would 
essentially result in a completely random policy, as there are only two possible 
hidden states each node can be in. An entirely different approach, policy-gradient 
methods, is employed by [9] in the context of combating dcnial-of-service attacks 
in P2P networks. However, this approach requires observing the rewards, which 
are in fact hidden in our case. 

1.3 Our contributions 

Our first proposed algorithm relies on bounds which do not require knowledge 
of prior probabilities regarding the type of a node (honest or malicious) neither 
known distributions for the observations corresponding to honest or malicious 
nodes. We only need to know the mean of each of these distributions. Conse- 
quently, it is substantially more lightweight than MDP solvers, since we take 
decisions without performing explicit planning. Thus, it is more suitable for re- 
source constrained environments such as wireless communication networks. We 
analyse the expected loss of this algorithm, and show that it is not significantly 
worse to that of an oracle which already knows each node's type. 

Our second contribution is to derive three approximate MDP solvers for this 
problem. In contrast to previous work, in our scenario the reward is never ob- 
served by the algorithm^ This is necessary, since knowing that a node gave you 
positive reward allows you to directly conclude that it is an honest node. Fur- 
thermore, two of our MDP algorithms are fundamentally different from those 
previously employed in the intrusion response literature, as we forego the most- 
likcly-state approximation commonly used in POMDP-based approaches. The 
first approach we consider is a myopic approximation. This is equivalent to the 
most likely state approximation and ot a sequential probability ratio test under 
some conditions. The second approach is a lightweight optimistic heuristic that 
performs no planning, which is derived from upper bounds |10j on Bayesian de- 
cision making in unknown MDPs . To our knowledge, this approach has not 
been used in similar problems before. Finally, we consider online planning with 
finite lookahead [1518] . This approach takes decisions which consider the impact 
of all our possible future actions up to some horizon. While this approach has 
not been considered for intrusion response problems before, we note that it has 
been employed in other applications such as dialogue modelling [5] , autonomous 
underwater vehicle mapping |16j . preference elicitation [3] and sensor schedul- 
ing |12| in wireless sensor networks. 

2 Preliminaries 

We consider a network composed of a set of nodes, which can be either honest or 
malicious. In this work, we ignore the specific network topology, beyond assuming 
that there is a reliable way to obtain some statistics from each node. We denote 



2 Although of course the reward is used in the experiments to measure performance. 
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by Q the set of all malicious nodes and by hi the set of all honest nodes. We 
consider that there is an entity £ (for instance an Internet Service Provider (ISP) 
or a network administrator) who gains some reward (gain) gu from each moment 
that a honest node remains in the network and has a cost (loss) £q for each 
moment thata malicious node stays in the network. A node may be removed 
by £ at any time, for example through black-listing. However, re-inserting a 
removed node is not normally possible. 

We use N to denote the (possibly random) time at which £ removes a node 
from the network. In addition, any honest node may leave the network at some 
(random) time H. Specifically, we assume that an honest node may decide to 
leave the network with some small probability A > 0, independently over time. 
Then it holds that E[H] = ±. 

Assumption 1 We assume that each node has a fixed type (i.e. honest or ma- 
licious) that is not changing over time. The type is hidden from £. 

£ not only does not know the type of each node, but it also never observes 
the rewards obtained or the cost incurred. However, at each time-step t and for 
each node i, £ receives some information signal xn S [0,1], characterising the 
behaviour of that node i within the time interval t £ IN. 

Assumption 2 We assume that x^i, . . . ,x%,t o- r ^ independent^ but not neces- 
sarily identically distributed, random variables and it holds: 

E[x M \Q] = q, E[x iit \ U]=u. (1) 

It is important to while the expected value is constant for all t, the observed 
average of j x i,t f° r eacn node i will initially be far from the expected value 

for small t. The average, together with the total number of observations for each 
node form a summary of the information received by each node. The relationship 
between these quantities will be looked at more closely in the analysis. 

Finally, we place no specific meaning to q and u in this work, as they are 
application-dependent. In an intrusion response scenario (e.g. [13]). they could 
be considered as the detection rate (DR) and the false alarm rate (FA) cor- 
respondingly of an employed intrusion detection system. Then x^t would cor- 
respond to alarm signals, with lower and high values for innocent-looking and 
suspicious behaviour respectively Corrcspodingly, in a peer-to-peer scenario (e.g. 
[17]), they could be fairness or reputation scores of each node. 

In the remainder, we always refer to some arbitrary node in the network and 
thus make no distinction between nodes. This is because the algorithms that 
we examine, consider each node independently of the others. Consequently, the 
following section analyses the expected loss for a single node of unknown type. 



3 This assumption could in perhaps be relaxed if it is possible to form a martingale 
difference sequence from the sequence of observations. 
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3 The loss model 



As previously mentioned, £ obtains a small gain for each time-step an honest 
node is within the network, and a small loss for each time-step a malicious node 
remains in the network. Formally, we can write that the total gain G we obtain 
from some node i, which £ removes at time N, and which would voluntarily 
leave at time H is: 

G{i,H,N) = {- N ^ ieQ (2) 

[mm{H,N}g u , ieU. 

£ wants to choose some node removal policy tt that maximises his total expected 
gain. That means that £ needs to keep as many as possible honest nodes in the 
network and eliminate the nodes that behave maliciously. In our analysis, we 
compare the expected gain of our policy tt with that of an oracle. The oracle 
always knows the type of each node (i.e. honest or malicious), and thus, employs 
the optimal policy tt* . For i e Q, according to the optimal policy tt* it holds 
N = 0, while for i G U according to the optimal policy tt* it is N = oo. 
Correspondingly, 



u*'[G®] = <' (3) 




Let the loss L be the difference between the gain of the optimal policy and our 
policy. In particular, the expected loss of policy tt for a node of type v is defined 
as: 

E n [L\v]=E w , (v) [G\v]-E v [G\v], (4) 

where the i subscript has been dropped for simplicity. The expected loss is 
bounded by the worst-case expected loss: 

E n [L] < max E n [L | v], (5) 
ve{Q.u} 

which we wish to minimise. If £ removes node i from the network at random 
time JV, then he does not receive any more observations x^t for this node from 
the IDS. Thus, in essence, we want to find a stopping rule, that will let £ to 
determine the random time at which stopping occurs, i.e. £ takes the decision 
that ieQ and removes it from the network. We note that, < A^ < oo, where 
A^ = oo if stopping never occurs. 

Since £ does not know if node i is honest or malicious, it must collect a 
sufficient number of samples so as to only remove nodes for which it is reasonably 
certain that they are malicious. On the other hand, malicious nodes must be 
removed as soon as possible, since the operator incurs a cost for every moment 
they remain in the network. The first algorithm we consider uses simple statistics 
to make nearly optimal decisions about which nodes to keep. 
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4 The HiPER Algorithm 



The algorithm, depicted in Alg. [TJ uses the knowledge we have about malicious 
and honest nodes (see equation [T]). This is done by calculating the average of all 
the observations generated by a node i until time t: 



and adding an appropriate confidence interval so that errors are made with 
low probability. Informally, HiPER keeps nodes in the network as long as the 
statistic 9t is sufficiently far from the expected statistic q of malicious nodes. In 
order to avoid throwing away honest nodes prematurely, it always keeps nodes 
for a certain number of steps to obtain more reliable statistics. However, as time 
passes, it needs more and more evidence to kick a node out. Consequently, the 
probability that an honest node is thrown out is bounded. 

The analysis of the algorithm proceeds in three steps. First, we calculate the 
expected loss of the algorithm when faced with a node of malicious type. Then, 
we calculate the loss for honest nodes. Subsequently, we combine the two losses 
and tune the algorithm's input parameters to obtain an overall loss bound. 



Algorithm 1 HiPER Algorithm for Optimal Response 



The first bound only depends upon the input parameter <5, the error proba- 
bility we wish to accept, and the loss £q incurred by malicious nodes. We prove 
that the expected loss is polynomially bounded in terms of both 6 and £q. 

Lemma 1. For Algorithm]^ with input parameter 5, and A = \u — q\, the 
expected loss when the node is malicious is bounded as: 




(0) 



Parameters: S, A, q £ [0, 1] 
Loop: For each node i in the network: 
For each time-step t do: 



if \0 t -q\< y / ^P^ and t > i 



remove node i from the network 
else keep node i in the network, 
end if 
end For 
end For 



E[L | Q] < 



(1-5)2 



(7) 



The proof of this lemma can be found in the appendix. Naturally, the expected 
loss is linearly dependent on the loss of keeping a malicious node in the network, 
while the dependence on the error probability is quadratic. 
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The second bound depends on the input parameter Z\, which corresponds to 
how far we expect the statistics of honest nodes to be from q, the gain obtained 
by honest nodes gu and the leaving probability of honest nodes A. Once more, 
we obtain a polynomial loss bound in terms of those variables. 

Lemma 2. If A = \u — q\, then the expected loss when the node is honest is 
bounded by: 



The proof of this lemma can be found in the appendix. Similarly to the previ- 
ous lemma, there is a linear dependence on the loss that is incurred when we 
erroneously remove an honest node, and a quadratic dependence on the rate of 
departure. In addition, there is a weak dependence on the gap A between the 
two means. 

Finally, we can combine everything in one bound by selecting a value for S 
that depends on A and which simultaneously makes the bounds tight: 

Theorem 1. Set A = \u — q\ and select: 



then the expected loss E L is bounded by: 



Proof. If we substitute ([9]) in (j4]) we get: 

E mQK ^ _ ^ _ 9u(A 2 + 2) 
(i_A)2 ~ i Q x{*+2X) XM2 + 9XV 

Thus, using ([S]) and Lemmas [1] and [5] we get: 

gu(A^ + 2) 
HL) ~ \(Ai + 2\) 

□ 



This theorem shows that the performance of HiPER only very weakly depends 
on the gap A between honest and malicious nodes. In addition, it is optimal up 
to a polynomial factor. 
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5 Modelling as a partially observable Markov decision 
Process 

A Partially Observable Markov Decision Process (POMDP) Q2] is a generalisa- 
tion of a Markov Decision Process (MDP). More precisely, a POMDP models 
the relationship between an agent and its environment when the agent cannot 
directly observe the underlying state. A POMDP can be described as a tuple 
< S, A, O, T,il,R > where S is a finite set of states, A is a set of possible actions, 
O is a set of possible observations, T is a set of conditional transition probabilities 
and Q is a set of conditional observation probabilities and R : A, S — > M. 



5.1 Intrusion Response and POMDP 

We can model our intrusion response problem as a POMDP if we consider that 
a node of the network at each time-step t has a state St € S with St = (vt, Ct) 
where v t € {0, 1} and Ct G {0, 1} such that: 



Vt 



0, if the node is honest, 

1, if the node is malicious. 

if the node is in the network, 
if the node is out of the network. 



where it holds that ¥(v t +i = v t ) — 1 since v t is stationary (i.e. a malicious node 
is always malicious and an honest node remains honest) based on Assumption 

m 

Additionally, at each time-step t, £ can perform an action a t G {0, 1} such 
that: 

0, if £ keeps the node in the network, 

1, if £ removes the node from the network. 



at 



Furthermore, the following independence condition holds : 

F(v t+ i | v t ,c t ,a t ) =P(«t+i | v t ), (12) 

since the type of a node (i.e. malicious or honest) does not depend on £'s action 
(i.e. remove from the network or not) neither on whether the node is in the 
network or out. In addition, since the type of a node never changes, it holds: 

F(vt +l =j\vt=j) = l. (13) 

Consequently, we remove the time subscript from v in the sequel. On the other 
hand the probability that a node will be in the network depends on if it is already 
in or out and the action that £ will take: 



P(c f+ i | ct,v,a t ) =P(c t+ i | ct,a t ) 



(14) 
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From equations (fT5|) and HU, it is evident that the POMDP under consideration 
is factored. 

To fully specify the model we must assume some probability distribution for 
the observations. Specifically, we model Xt as drawn from a Bernoulli distribu- 
tion with parameters u and q for honest and malicious nodes respectively. 

P(x 4 = 1 | v = 0) = u and ¥(x t =l\v = l) = q. (15) 

Let x t = (x\, . . . , Xt) be a i-length sequence of observations. From Bayes' theo- 
rem, we obtain an expression for our belief at time t: 

nv = i xt) = n*\« = mv = j) (16) 

where j £ {0, 1}. Thus, the expected gain at time t if £ decides to keep a node 
in the network is: 

E[G t | c t = 0,x t ] = F(v = | x t ) ■ gu - P(v = 1 | x t ) • i Q , 

while the expected gain if £ decides to remove the node from the network is 
always: 

E[G t | ct = 1] = 0. 

The problem is to find a policy tt : X* — > A, mapping from the set of all possible 
sequences of observations to actions, maximising the total expected gain: 

Since future gains depend on any future observations we might obtain, the exact 
calculation requires enumerating all possible future observations. Consequently, 
the exact solution to the problem is intractable [8111110] . In the next section we 
describe possible approximations to this problem. 



5.2 POMDP algorithms 

We consider three algorithms: a) A myopic algorithm, which only considers the 
expected gain at the current time-step; b) An optimistic algorithm, which com- 
putes an upper bound on the total expected gain; c) A finite lookahead algo- 
rithm, which performs complete planning up to some fixed finite depth. While 
these algorithms have appeared before in the general MDP literature, they have 
not been applied before to intrusion response problems. We do not consider the 
most likely state approximation (MLS), since in our case there are only two pos- 
sible hidden states for a node, thus, rendering the approximation far too coarse 
for it to be effective. 

4 This distribution is particularly convenient for computational reasons, because 
closed-form Bayesian inference can be performed via the Beta conjugate prior [8]. 
However, in principle it can be replaced with any other distribution family, without 
affecting the overall formalism. 
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Myopic. In this case, £ only considers the expected gain for the next time-step 
when taking a decision. Consequently, £ keeps the node in the network if the 
following condition holds: 

E[G t | a t = 0] > E[G t \ a t = 1]. (18) 

This algorithm is the closest to the MLS approximation among the ones consid- 
ered. In fact, it is easy to see that it would be identical to MLS, as well as to a 
sequential probability ratio test, when £q = gu- 

Optimistic. This rule constructs an upper bound on the value of the decision to 
keep a node in the network, which is based on Proposition 1 in [TO]. Informally, 
this is done by assuming that the true type of the node will be revealed at the 
next time-step. Then £ keeps the node in the network if and only if: 

F(v t = | x t ) • gu/X > V(v t = 1 | x*) • l Q . (19) 

Intuitively, if the node is revealed to be malicious, then we can remove it at the 
next step and consequently wc only lose £q. In the converse case, we can keep 
it for an expected 1/A steps. 

Finite lookahead. The finite lookahead algorithm performs backwards induc- 
tion [5] up to some finite depth T, at every time-step. More precisely, any 
sequence of observations x f = (xi,...,Xt) results in a posterior probability 
P(v t | x t ). Let: 

OO 

Vt^Y. G * ( 2 °) 

k=t 

be the total gain starting from time-step t. Then, the expected gain under the 
optimal policy is determined recursively as follows: 

E(V t | x t ) = max{0,E(Gt | x t ,a t = 0) +E{V t+1 | x t )} (21) 
E(V t+1 | x t )=p t E(G t | xt.xt+i = 1)+ 

(1 - p t )E(V t+1 | x t , zt+i = 0) (22) 

where p t 4 p(a; t+1 = 1 | x*) = ELo P (-^+i = 1 | « = i)P(» = i | x t ) is the 
marginal posterior probability that Xt+\ = 1. For more details on this backwards 
induction algorithm, the reader is urged to consult |8lllj . 

6 Experimental Evaluation 

We perform three sets of experiments. The first set investigates the performance 
of HiPER with various choices of the parameter S, including the optimal choice 
suggested by Theorem[TJ The second set compares HiPER with the myopic and 
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Fig. 1. Simulations with Alg. 1, for four different choices of S. In particular 5i = 0.9, 
62 = 0.95, 83 — 0.99 and 5* is chosen according to Theorem 1. It can be seen that, 
while the algorithm is not extremely sensitive to the exact choice of S, the optimal 
value is generally more robust. 



optimistic approximations. In the final set of experiments, we compare the opti- 
mistic with the finite lookahead approximation. In all cases, we collected results 
from 10 runs, with 100 nodes per run, and we plot a moving average of the 
expected loss as various network parameters change. Specifically, the first results 
we report (i.e. Fig. [I]) are made through 10 4 experiments. For each experiment, 
we selected a horizon H ~ Uniform([l0, 1000]), user and adversary parameters 
u, q r~j Uniform{[0, 1]), and user gain gu ~ Uniform{[0, 1]) and we set £q = 1. 
Each experiment measured the loss for a network containing 100 nodes, each 
of which had a probability p of being malicious, with p ~ Beta(2, 2) for each 
experiment. During each run, the i-th node generates a sequence of observations 
Xi.t drawn from a Bernoulli distribution with parameter u if the node is honest 
and q if the node is malicious. The results are shown in Fig. [T] show a summary 
of the results, averaged over these trials. It can be seen that, while HiPER's per- 
formance is relatively robust to the choice of 8, nevertheless the optimal choice 
suggested by Theorem [T] generally leads to small losses. 

For our second set of experiments, shown in Figure [5J we compare HiPER 
with the optimistic and myopic algorithms. We increased the range of user gains 
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Fig. 2. Comparison of HiPER with the myopic solver and the optimistic approximation 
for various network conditions. It can be clearly seen that the myopic approximation 
is significantly worse than both approaches. However, the optimistic approach outper- 
forms the worst-case HiPER algorithm when the proportion of malicious nodes is low. 
The optimistic approach is also better when the payment for honest nodes is high. 



to gu ~ Uniform([0,2]) compared to the previous setup, but the other experi- 
mental parameters remain the same. It is clear that the myopic approximation 
has almost always a higher loss compared to both HiPER and the optimistic 
algorithm. The latter, while performing at a similar level to HiPER, has an 
advantage when either the proportion of malicious is small or when gu is large. 
This makes sense intuitively, since in those cases the optimism is justified. In the 
converse case, however, the optimistic approach performs worse than HiPER, 
which is less sensitive to the proportion of malicious nodes, since it is a worst-case 
approach. 

Finally, we performed some experiments comparing the optimistic approxi- 
mation with the finite-lookahead POMDP solvers for lookahead for T time-steps 
where T £ {4, 8}. While these do not solve the problem to the end of the horizon 
H, they plan ahead for T steps at every time-step of the simulation. Unfortu- 
nately, the complexity of these solvers is exponential in T, which limited the 
amount of simulations we could perform to 10 3 and we only considered horizons 
H ~ Uniform ([1,100]). These experiments are shown in Fig. [3J In compari- 
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Fig. 3. Comparison of the optimistic approximation with approximate non-myopic 
POMDP solvers for planning lookahead of T time-steps where T € {4,8}. It can 
be seen that, for short horizons, these perform just as well and that they are more 
robust to the proportion of malicious nodes in the network. However, these methods 
are computationally more intensive, with complexity 0(e T ). 



son with Fig. [5J the finite lookahead algorithms performs much better than the 
myopic approximation and indeed the 8-step lookahead manages to slightly out- 
perform the optimistic approximation. In addition, it is much more robust to the 
proportion of malicious nodes in the network. However, the relative advantage 
of the 8-step to the 4-stcp lookahead is relatively small for the amount of extra 
computation required^ 

7 Conclusion 

This paper defined a network management problem that arises frequently in 
communication networks. Namely, whether to remove a suspicious node from 
the network, with the amount of available evidence, or to collect some further 
data before taking the final decision. This is in fact a type of stopping problem, 
which we believe is of relevance to many networking applications where blacklist- 
ing may be performed. This includes applications such as automated intrusion 

5 The computational effort is exponential in T. 
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response, as well as ensuring fairness in peer-to-peer networks, such as |19j . To 
this end, we proposed and analysed, both theoretically and experimentally, a 
simple algorithm, HiPER, that achieves low worst-case expected loss relative to 
an oracle that knows a priori the type (honest or malicious) of every node in 
the network. In addition, we derived and compared a number of algorithms by 
modelling the problem as a POMDP: a myopic and an optimistic approximation, 
as well as a finite lookahead solver. Of those, the optimistic approximation and 
the partial finite lookahead solvers perform the best, with the finite lookahead 
methods being the most robust, while simultaneously being computationally de- 
manding. 

The main advantage of HiPER are its simplicity and lack of stringent as- 
sumptions on the distribution. This makes it suitable for deployment in most 
situations. However, in certain cases a full probabilistic model and computational 
resources are available, in which case one of the approximate solvers would be 
useful. The myopic approximation, which is almost equivalent to the widely-used 
MLS approximation, performs the worse. The overall best performance is offered 
by the finite lookahead. To our knowledge, neither the optimistic approximation, 
nor the finite lookahead methods have been applied before to this problem or 
more generally to intrusion response problems. They should be more generally 
applicable for other types of intrusion response and network management prob- 
lems. It is our view that they are inherently more suitable than other approx- 
imations such as the most likely state (MLS) approximation (or cquivalently, 
a sequential probability ratio test) which in our setting produces an essentially 
random policy. 

For future work, we would like to extend our theoretical analysis to the 
performance of the optimistic and the finite lookahead algorithms. In addition, 
it would be interesting to examine a more general game-theoretic scenarios, 
including strategic attackers |2llj . as well as colluding nodes. Finally, we would 
like to generalise our setting so that observations must be explicitly gathered 
from each node, where it is not possible to continuously sample all nodes due 
to budget constraints. In fact, the sampling problem in the context of intrusion 
detection, has been recently studied by [1414] . A natural extension of our work 
would consequently be to optimally combine sampling and response policies. 



A Proofs 

This section collects the missing proofs from the main text. 

Proof ( (Lemma{^). Since the node i under consideration is malicious, i.e. i S Q, 
it holds that: E^.t | Q] = q. Then, we have: 
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From Hocffding's inequality [Lemma [31 in the Appendix), we have: 

P(|0 t -g|>e t |Q)<2exp(-2fe t 2 ), (23) 

where e t > and P(|#t — q\ > e t \ Q) denotes the probability that 9 t (which is 
random) is very far away from q (which is fixed). 
Now let set: 

as in Algorithm 1. Then, since equation l23l holds for any e t > 0, we get that the 
probability of keeping a malicious node i €E Q in the network is at most 5: 



Ift-9|>V^IG)<* ( 24 ) 
Thus, we have: 

oo 

E[L | Q] = E[N | Q] .£ Q = ^ / P{N = t \ Q)-t-i Q 

t=o 

Thus: 

OO OO 

E[i| Q] = ^s^F(A r = t | Q) • t < <5 t_1 • t 

t=0 t=0 



(i-<y) a 



□ 



Proof ((LemmaWj)). We denote by N the time-step at which £ removes node i 
from the network. Then, the function g : N 2 — > R that gives us the gain for each 
node i is defined as: 

gin, h) = minjra, h} ■ gu (25) 

where h & H and n G AT. 

Since the node i under consideration is honest, i.e. i £ U, we have Efx^t | IA] = u. 
Without loss of generality we assume that: u = q + A, where A > 0. So we only 
need P(0 t -q<e t \U). 

Since q — u — A from the Hocffding inequality (Lemma^ in the Appendix), we 
have: 

P(AT = t\U)< P{N <t)< F(8 t - u < e t - A\U) 

< exp(-2 • t{e t - A) 2 ) 

where A — et > 0. It holds that: 
E[G \U,N = n] = 

OO 

^P(ff = h | U, N = n)E[G | U, N = n,H = h] (26) 

n=0 
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But it holds that: 

E[G \ U,N = n,H] = g(n,h) 
and since h £ H and n £ N are independent we have: 



Thus, 



¥(H = h\U,N = n) = F(H = h\ U). 
E[G \U,N = n] = 

CO 

J2nH = h\U)-g(n,h) = 
h=a 

Y P(-ff = h\U) min{n, h} ■ g u = 

n— 1 oo 

g u -{Y / P (H = h\U)-h+Y / nH = h\U)-n} (27) 

h— h—n 

The expected loss is given by subtracting from the expected gain of the oracle 
policy, when £ never removes the node from the network (i.e. N = oo), the 
expected gain when £ removes the node at the time-step N — n. Thus, it holds: 

E[L \ U,N = n] = 

E[G | U, N = oo] - E[G | U, N = n] = 
lim (E[G | U, N = n]) - E[G | U, N = n] = 



n—¥oo 

oo n— 1 



g u Y ¥ ( H = h ) ■ h - 9u{ Y V ( H = h ) ■ h + F( - H = h ) ' n } 

h=0 h=0 h=n 

oo oo 

,{^>(ff = /i)-/i-^>(# = /i)-n} (28) 



9u\ 

h—n h—n 

Since, by definition F(H = h+1 \ H > h) = A, we have P{H = h) = (l-A)' 1 -^. 
Consequently, 

/ oo oo 

E[L I U,N = n] =g u ■ \( - A)' 1-1 • h- - A)' 1 " 1 

^ /i— n h—n 

=»-^ (29) 

Thus, we have: 

oo 

E[L\U] =Y p ( N ^ t \U)E[L\N = t] 
t=o 

< £ exp(-2 • t • ( et - Z\) 2 ) • 9u^^- 
t=o 
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Since the algorithm uses et = we have: 



E[L\U] <y^exp(-2-t 



t=o 

9u_ 
A 



Vt 



-A 



(1 - A)* 



exp (-2Z\ 2 (Vt - l) 2 ) (1 - A) 



t=0 
oo 



t=0 
oo 



(i-xy 



^E 



cxp (~ ) 

gu 



< 



[1 - ex P (-f) (1 - A)] A 

gu(A 2 + 2) 



\{A 2 + 2A) 



where t > 2. 



(30) 
□ 



B Additional results 

Definition 1 (Bernoulli distribution). IfXi, . . . , X„ are independent Bernoulli 
random variables with Xk £ {0, 1} and ¥(Xk = 1) = /i /or a// fc, iften 

P(E^>H -E(fc)^(l-^)"- fc . (31) 



Lemma 3 (Hoeffding). For independent random variables Xi, . . . ,X n such 
that Xi £ [ai, bi], with fa = MXi and t > 0: 

'(t*^t*+-)^(- n ^ 5 .)- 

The same in equality holds for J27=i %i < $3»=i Mi — 
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